The Pursuit: Unraveling Homicides¶

PROJECT OVERVIEW¶

Introduction :¶

In order to better understand the context, trends, and underlying causes of homicide in a particular location or across numerous regions, our study intends to provide a thorough analysis of homicide report data. This analysis includes the investigation of a number of homicide-related issues, including victim characteristics, crime scenes, suspects' intentions, and the influence of societal, economic, and environmental factors.

Homicide is a serious social problem with ramifications across society. It affects not only the criminal justice system but also critically affects social and public health. The focus of our study is to look deeply into homicide report data, which normally contains details about each homicide case, including the date, time, location, victim's characteristics, suspect information, cause of death, and, in some circumstances, the crime's motivation.


FILL IN HERE "Introduction (6 points)

An introduction that sets the stage for your analysis, describes the questions you attempted to answer at a high level, explains why the answers to the question matter, and provides a detailed description of the data set(s) and how they were acquired. Do not just copy the introduction from your proposal/project update. The final report introduction must be based on the report you are presenting."

FILL IN HERE "Choice for Heavier Grading on Data Processing or Data Analysis (1 point)

A description of whether your project should be graded more heavily on data processing or data analysis. You must choose one option or the other. If you select data processing, clearly describe why you believe the work you did goes above and beyond basic data processing needed for most data sets. If you select data analysis, clearly describe why you believe the work you did goes above and beyond basic data analysis needed to answer your questions."

Data Acquisition :¶

We start with importing data from Kaggle, where we carefully review a CSV file to make sure it is appropriate for our analysis environment. After that, we turn our attention to data cleansing and missing data management, which is a comprehensive review process that includes finding and addressing missing or unnecessary information as well as getting rid of duplicate records. We perform basic evaluations to comprehend data distributions, identify outliers, and display patterns that serve as a basis for more in-depth insights before fluidly transitioning into exploratory data analysis.we investigate victim age, sex, weapon type, offender sex, and victim race by utilizing the analytical features of libraries such as Pandas and NumPy. Age distribution is shown, and gender, the use of weapons, and the demographics of the offenders are all subtly understood by filtering procedures. We utilize MATLAB's robust data visualization features to clarify the weapon and victim race distribution in the dataset.

Importing Libraries :¶

In [1]:
# Import all required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import nan as NA
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
import scipy
import plotly.graph_objects as go

Loading Dataset :¶

In [2]:
# We would like to load our csv file in to a dataframe for processing and analysis
path = ''
homicides_df = pd.read_csv(path + 'database_Homicide.csv', dtype = { 16: 'str' })

As Perpetrator Age column contains values of mutiple datatypes, we import it as string.

In [3]:
# Review dataset
homicides_df.head(10)
Out[3]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Victim Ethnicity Perpetrator Sex Perpetrator Age Perpetrator Race Perpetrator Ethnicity Relationship Weapon Victim Count Perpetrator Count Record Source
0 1 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 January 1 Murder or Manslaughter ... Unknown Male 15 Native American/Alaska Native Unknown Acquaintance Blunt Object 0 0 FBI
1 2 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 1 Murder or Manslaughter ... Unknown Male 42 White Unknown Acquaintance Strangulation 0 0 FBI
2 3 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 2 Murder or Manslaughter ... Unknown Unknown 0 Unknown Unknown Unknown Unknown 0 0 FBI
3 4 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 1 Murder or Manslaughter ... Unknown Male 42 White Unknown Acquaintance Strangulation 0 0 FBI
4 5 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 2 Murder or Manslaughter ... Unknown Unknown 0 Unknown Unknown Unknown Unknown 0 1 FBI
5 6 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 1 Murder or Manslaughter ... Unknown Male 36 White Unknown Acquaintance Rifle 0 0 FBI
6 7 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 2 Murder or Manslaughter ... Unknown Male 27 Black Unknown Wife Knife 0 0 FBI
7 8 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 1 Murder or Manslaughter ... Unknown Male 35 White Unknown Wife Knife 0 0 FBI
8 9 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 2 Murder or Manslaughter ... Unknown Unknown 0 Unknown Unknown Unknown Firearm 0 0 FBI
9 10 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 3 Murder or Manslaughter ... Unknown Male 40 Unknown Unknown Unknown Firearm 0 1 FBI

10 rows × 24 columns

In [4]:
# Checking our data rows and coloumns
homicides_df.shape
print(f'No. of rows: {homicides_df.shape[0]}')
print(f'No. of colums: {homicides_df.shape[1]}')
No. of rows: 638454
No. of colums: 24
In [6]:
# We would like to load our csv file in to a dataframe for processing and analysis
path = ''
locations_df = pd.read_csv(path + 'uscities.csv', usecols = [0, 3, 6, 7])
In [7]:
locations_df = locations_df.rename(columns = {'city': 'City', 'lat': 'Latitude', 'lng': 'Longitude', 'state_name': 'State'})
In [8]:
locations_df.head(10)
Out[8]:
City State Latitude Longitude
0 New York New York 40.6943 -73.9249
1 Los Angeles California 34.1141 -118.4068
2 Chicago Illinois 41.8375 -87.6866
3 Miami Florida 25.7840 -80.2101
4 Houston Texas 29.7860 -95.3885
5 Dallas Texas 32.7935 -96.7667
6 Philadelphia Pennsylvania 40.0077 -75.1339
7 Atlanta Georgia 33.7628 -84.4220
8 Washington District of Columbia 38.9047 -77.0163
9 Boston Massachusetts 42.3188 -71.0852

Data Cleaning :¶

In [9]:
# Dropping some columns that we will not use for later analysis
homicides_df.drop(columns=['Record Source', 'Perpetrator Race', 'Perpetrator Ethnicity'], inplace=True)

The code eliminates these columns in an attempt to streamline the dataset, increase computing performance, cut down on redundancy, and concentrate on the most important aspects for the latter phases of analysis. In order to enable a more effective and efficient analysis process, this operation proposes a planned and focused approach, highlighting the significance of feature selection and simplified data.

In [10]:
# checking to see if we have any duplicates
print(f'No. of duplicate rows: {homicides_df.duplicated().sum()}')
No. of duplicate rows: 0

There are no duplicated records.

In [11]:
# Using this to check for missing/blank values in the dataset column wise
missing_vals = homicides_df.isnull().sum()
missing_vals
Out[11]:
Record ID            0
Agency Code          0
Agency Name          0
Agency Type          0
City                 0
State                0
Year                 0
Month                0
Incident             0
Crime Type           0
Crime Solved         0
Victim Sex           0
Victim Age           0
Victim Race          0
Victim Ethnicity     0
Perpetrator Sex      0
Perpetrator Age      0
Relationship         0
Weapon               0
Victim Count         0
Perpetrator Count    0
dtype: int64

We have found no missing values

While reviewing the dataframe we found that instead of NaN we have 'Unknown' as the missing value. So, let's replace it.

In [12]:
# Find count of 'Unknown' in each column
unknown_counts = {}
for column in homicides_df.columns:
    unknown_counts[column] = (homicides_df[column] == 'Unknown').sum()

print(unknown_counts)
{'Record ID': 0, 'Agency Code': 0, 'Agency Name': 47, 'Agency Type': 0, 'City': 0, 'State': 0, 'Year': 0, 'Month': 0, 'Incident': 0, 'Crime Type': 0, 'Crime Solved': 0, 'Victim Sex': 984, 'Victim Age': 0, 'Victim Race': 6676, 'Victim Ethnicity': 368303, 'Perpetrator Sex': 190365, 'Perpetrator Age': 0, 'Relationship': 273013, 'Weapon': 33192, 'Victim Count': 0, 'Perpetrator Count': 0}
In [13]:
# Replace 'Unknown' with Nan
homicides_df.replace('Unknown', NA,inplace=True)
homicides_df.head(10)
Out[13]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Victim Sex Victim Age Victim Race Victim Ethnicity Perpetrator Sex Perpetrator Age Relationship Weapon Victim Count Perpetrator Count
0 1 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 January 1 Murder or Manslaughter ... Male 14 Native American/Alaska Native NaN Male 15 Acquaintance Blunt Object 0 0
1 2 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 1 Murder or Manslaughter ... Male 43 White NaN Male 42 Acquaintance Strangulation 0 0
2 3 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 2 Murder or Manslaughter ... Female 30 Native American/Alaska Native NaN NaN 0 NaN NaN 0 0
3 4 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 1 Murder or Manslaughter ... Male 43 White NaN Male 42 Acquaintance Strangulation 0 0
4 5 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 2 Murder or Manslaughter ... Female 30 Native American/Alaska Native NaN NaN 0 NaN NaN 0 1
5 6 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 1 Murder or Manslaughter ... Male 30 White NaN Male 36 Acquaintance Rifle 0 0
6 7 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 2 Murder or Manslaughter ... Female 42 Native American/Alaska Native NaN Male 27 Wife Knife 0 0
7 8 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 1 Murder or Manslaughter ... Female 99 White NaN Male 35 Wife Knife 0 0
8 9 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 2 Murder or Manslaughter ... Male 32 White NaN NaN 0 NaN Firearm 0 0
9 10 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 3 Murder or Manslaughter ... Male 38 White NaN Male 40 NaN Firearm 0 1

10 rows × 21 columns

We know that Perpetrator Age column has values in string format. Let's convert it into integer so we can perform caclulations easily

In [14]:
# to change the type of perpetrator age
homicides_df['Perpetrator Age'] = pd.to_numeric(homicides_df['Perpetrator Age'], errors='coerce')

We found many outliers based on age. We want to remove them for both Victim and Perpetrator.

Let's do for perpetrator first.

In [15]:
#Calculate the mean and standard deviation of the column
mean = homicides_df['Perpetrator Age'].mean()
std_dev = homicides_df['Perpetrator Age'].std()

# Define a threshold (for instance, outliers beyond 3 standard deviations)
threshold = 3

# Filter the DataFrame to exclude values beyond the threshold
homicides_df = homicides_df[(homicides_df['Perpetrator Age'] < (mean + threshold * std_dev)) & (homicides_df['Perpetrator Age'] > (mean - threshold * std_dev))]
homicides_df['Perpetrator Age'].describe()
Out[15]:
count    634745.000000
mean         19.971691
std          17.332278
min           0.000000
25%           0.000000
50%          21.000000
75%          31.000000
max          73.000000
Name: Perpetrator Age, dtype: float64

Let's do it for Victim's age now.

In [16]:
#Calculate the mean and standard deviation of the column
mean = homicides_df['Victim Age'].mean()
std_dev = homicides_df['Victim Age'].std()

# Define a threshold (for instance, outliers beyond 3 standard deviations)
threshold = 3

# Filter the DataFrame to exclude values beyond the threshold
homicides_df = homicides_df[(homicides_df['Victim Age'] < (mean + threshold * std_dev)) & (homicides_df['Victim Age'] > (mean - threshold * std_dev))]
homicides_df['Victim Age'].describe()
Out[16]:
count    633775.000000
mean         33.386476
std          17.624169
min           0.000000
25%          22.000000
50%          30.000000
75%          41.000000
max          99.000000
Name: Victim Age, dtype: float64
In [17]:
(homicides_df['Victim Age'] == 0).sum()
Out[17]:
8442
In [18]:
(homicides_df['Perpetrator Age'] == 0).sum()
Out[18]:
215687

As the 0 value of age is not possible, we can consider it as unknown value. So, we replace 0 with NaN.

In [19]:
homicides_df['Victim Age'] = np.where(homicides_df['Victim Age'] == 0, np.nan, homicides_df['Victim Age'])
homicides_df['Victim Age'].describe()
Out[19]:
count    625333.000000
mean         33.837194
std          17.307616
min           1.000000
25%          22.000000
50%          30.000000
75%          42.000000
max          99.000000
Name: Victim Age, dtype: float64
In [20]:
homicides_df['Perpetrator Age'] = np.where(homicides_df['Perpetrator Age'] == 0, np.nan, homicides_df['Perpetrator Age'])
homicides_df['Perpetrator Age'].describe()
Out[20]:
count    418088.000000
mean         30.298043
std          11.954250
min           1.000000
25%          21.000000
50%          27.000000
75%          37.000000
max          73.000000
Name: Perpetrator Age, dtype: float64

Transform Dataset :¶

In [21]:
homicides_df = pd.merge(homicides_df, locations_df, on = ['City','State'], how='left')
homicides_df.head(10)
Out[21]:
Record ID Agency Code Agency Name Agency Type City State Year Month Incident Crime Type ... Victim Race Victim Ethnicity Perpetrator Sex Perpetrator Age Relationship Weapon Victim Count Perpetrator Count Latitude Longitude
0 1 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 January 1 Murder or Manslaughter ... Native American/Alaska Native NaN Male 15.0 Acquaintance Blunt Object 0 0 61.1508 -149.1091
1 2 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 1 Murder or Manslaughter ... White NaN Male 42.0 Acquaintance Strangulation 0 0 61.1508 -149.1091
2 3 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 2 Murder or Manslaughter ... Native American/Alaska Native NaN NaN NaN NaN NaN 0 0 61.1508 -149.1091
3 4 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 1 Murder or Manslaughter ... White NaN Male 42.0 Acquaintance Strangulation 0 0 61.1508 -149.1091
4 5 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 2 Murder or Manslaughter ... Native American/Alaska Native NaN NaN NaN NaN NaN 0 1 61.1508 -149.1091
5 6 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 1 Murder or Manslaughter ... White NaN Male 36.0 Acquaintance Rifle 0 0 61.1508 -149.1091
6 7 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 2 Murder or Manslaughter ... Native American/Alaska Native NaN Male 27.0 Wife Knife 0 0 61.1508 -149.1091
7 8 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 1 Murder or Manslaughter ... White NaN Male 35.0 Wife Knife 0 0 61.1508 -149.1091
8 9 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 2 Murder or Manslaughter ... White NaN NaN NaN NaN Firearm 0 0 61.1508 -149.1091
9 10 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 3 Murder or Manslaughter ... White NaN Male 40.0 NaN Firearm 0 1 61.1508 -149.1091

10 rows × 23 columns

Index Dataset :¶

In [22]:
# Setting the index as the 'Record ID' column for easier viewing
homicides_df.set_index('Record ID', inplace = True)
homicides_df.head(10)
Out[22]:
Agency Code Agency Name Agency Type City State Year Month Incident Crime Type Crime Solved ... Victim Race Victim Ethnicity Perpetrator Sex Perpetrator Age Relationship Weapon Victim Count Perpetrator Count Latitude Longitude
Record ID
1 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 January 1 Murder or Manslaughter Yes ... Native American/Alaska Native NaN Male 15.0 Acquaintance Blunt Object 0 0 61.1508 -149.1091
2 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 1 Murder or Manslaughter Yes ... White NaN Male 42.0 Acquaintance Strangulation 0 0 61.1508 -149.1091
3 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 March 2 Murder or Manslaughter No ... Native American/Alaska Native NaN NaN NaN NaN NaN 0 0 61.1508 -149.1091
4 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 1 Murder or Manslaughter Yes ... White NaN Male 42.0 Acquaintance Strangulation 0 0 61.1508 -149.1091
5 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 April 2 Murder or Manslaughter No ... Native American/Alaska Native NaN NaN NaN NaN NaN 0 1 61.1508 -149.1091
6 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 1 Murder or Manslaughter Yes ... White NaN Male 36.0 Acquaintance Rifle 0 0 61.1508 -149.1091
7 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 May 2 Murder or Manslaughter Yes ... Native American/Alaska Native NaN Male 27.0 Wife Knife 0 0 61.1508 -149.1091
8 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 1 Murder or Manslaughter Yes ... White NaN Male 35.0 Wife Knife 0 0 61.1508 -149.1091
9 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 2 Murder or Manslaughter No ... White NaN NaN NaN NaN Firearm 0 0 61.1508 -149.1091
10 AK00101 Anchorage Municipal Police Anchorage Alaska 1980 June 3 Murder or Manslaughter Yes ... White NaN Male 40.0 NaN Firearm 0 1 61.1508 -149.1091

10 rows × 22 columns

The code restructures the DataFrame so that the 'Record ID' values serve as the row identifiers by assigning the 'Record ID' column to the index. This change was made to improve data exploration's simplicity of use and clarity. By assigning a unique identification to the index, We are able to get individual entries, make data retrieval more efficient, and make the DataFrame easier to read overall.

Exploratory Data Analysis :¶

To get a better overview of dataset, we will do exploratory data analysis.

In [23]:
# we are trying to see which of our data are numerical or catagorical for better understanding of our data
def get_var_category(series):
    unique_count = series.nunique(dropna=False)
    total_count = len(series)
    if pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif unique_count==total_count:
        return 'Text (Unique)'
    else:
        return 'Categorical'

def print_categories(df):
    for column_name in df.columns:
        print(column_name, ": ", get_var_category(df[column_name]))
print_categories(homicides_df)
Agency Code :  Categorical
Agency Name :  Categorical
Agency Type :  Categorical
City :  Categorical
State :  Categorical
Year :  Numerical
Month :  Categorical
Incident :  Numerical
Crime Type :  Categorical
Crime Solved :  Categorical
Victim Sex :  Categorical
Victim Age :  Numerical
Victim Race :  Categorical
Victim Ethnicity :  Categorical
Perpetrator Sex :  Categorical
Perpetrator Age :  Numerical
Relationship :  Categorical
Weapon :  Categorical
Victim Count :  Numerical
Perpetrator Count :  Numerical
Latitude :  Numerical
Longitude :  Numerical

The code gives a thorough description of the characteristics of each column in the dataset by differentiating between date, category, and numerical variables. This classification is crucial since it provides the framework for customized analytical methods and well-informed decision-making. we are better equipped to use suitable statistical measurements, visualization approaches, and modeling tactics when we know whether a variable is numerical, categorical, or temporal. Additionally, the code highlights columns that have distinct text values, indicating possible areas that need extra investigation, including abnormalities or inconsistencies in the data.

In [24]:
# we are trying to descibe age trend of victim from our data set
homicides_df['Victim Age'].describe()
Out[24]:
count    625578.000000
mean         33.838722
std          17.308328
min           1.000000
25%          22.000000
50%          30.000000
75%          42.000000
max          99.000000
Name: Victim Age, dtype: float64

We discover that half of the victims are 30 years of age or younger, with a median age of 30.0, providing a measure of central tendency. The middle 50% of ages appear to lie within the interquartile range (IQR) of 22.0 to 41.0, indicating a wide variety of victim ages. But the extreme figures (0 and 998) raise red flags, suggesting that there could be outliers or issues with the quality of the data that need to be looked into further. The age distribution appears to be skewed to the right, as indicated by the somewhat wider range between Q2 and Q3 than between Q1 and Q2. The general tendency and diversity in victim ages are both highlighted by this detailed study, highlighting the significance of taking into account the larger context and any anomalies in the data in later conclusions.

In [25]:
# we are trying to descibe age trend of perpetrator from our data set
homicides_df['Perpetrator Age'].describe()
Out[25]:
count    418303.000000
mean         30.299360
std          11.954775
min           1.000000
25%          21.000000
50%          27.000000
75%          37.000000
max          73.000000
Name: Perpetrator Age, dtype: float64

With a total count of 638,453 entries, the dataset shows a broad variety of ages for the perpetrators, from 0 to 99 years old. The average age of those who commit crimes is roughly 20.32 years old, which indicates a comparatively youthful profile. The 17.89 standard deviation, on the other hand, suggests that there is a good deal of variation around this mean. The right-skewed distribution is indicated by the quartiles, which show that 25% of the instances contain perpetrators who are 0 years old, while the median (50th percentile) age is 21. The wide range of ages represented in the perpetrators of crimes is shown by the interquartile range (IQR) that extends from 0 to 31 years between the 25th and 75th percentiles. Interestingly, a sizable percentage of cases (25th percentile) have offenders who are under the age of zero, which may indicate cases in which the offender's age is either irrelevant or unknown.

Data Analysis & Visualization :¶

Let's take a closer look at the data using some visualizations. We want to answer a variety of questions based on the data, so that we can improve our understanding of the homicide data and maybe even predict some trends. Mainly, we want to be able to view the data in a concise manner and then make some inferences based on our visualizations.

Let's get started!

Question of Interest #1 :¶

What are the most commonly used weapons in homicides and why might this be the case? What are the regions or cities that have the most homicides? Why?

Let us first discuss part one of our question. What are the most common weapons used in homicides and why might this be the case?

In [26]:
# showing the different weapons used
weapon_count = homicides_df['Weapon'].value_counts()
weapon_count
Out[26]:
Weapon
Handgun          315302
Knife             94696
Blunt Object      66953
Firearm           46675
Shotgun           30353
Rifle             23117
Strangulation      8072
Fire               6136
Suffocation        3914
Gun                2186
Drugs              1574
Drowning           1204
Explosives          536
Poison              440
Fall                185
Name: count, dtype: int64

Let's make an observation based on the initial breakdown :¶

Handguns are the most common weapon in these kinds of situations, suggesting that they are widely available and versatile. The use of knives and blunt items later on emphasizes how important close-quarters physical aggression is. The various ways that guns are used, including shotguns, rifles, and general purpose guns, add up to a significant number. There are also less prevalent techniques like drowning, fire, suffocation, and strangulation, all of which shed light on the various circumstances surrounding some crimes.

Now, let's create a visualization of this!¶

We want to be able to view the most common weapons used in homicides over the years and make an inference based on the visualization.

In [27]:
threshold = 10000
small_counts = weapon_count[weapon_count < threshold]
weapon_count['Other'] = weapon_count[small_counts.index].sum()
weapon_count.drop(small_counts.index, inplace=True)
plt.pie(weapon_count, labels = weapon_count.index)
plt.show()

Let's make an inference based on our visualization :¶

Numerous factors may have an impact on the observed distribution of weapon types in homicides, with guns clearly leading the way. Handguns are a common choice in criminal activities because of their accessibility, portability, and simplicity of concealment. Furthermore, the statistics may indicate that pistols are widely accessible in different neighborhoods, which may account for their increased prevalence in violent situations. The explosion of other weapon types, like as knives and blunt weapons, is either negligible or nonexistent, indicating that they are there consistently but do not eclipse handguns.

Let us discuss part two of our question. What are the regions or cities that have the most homicides? Why?

Now, let's create a visualization of this!¶

We want to be able to view the cities with the highest homicide rates and then create an observation and an inference based on what we see.

In [28]:
city_homicide_counts = homicides_df.loc[:,['City','State','Longitude', 'Latitude']]
In [29]:
city_homicide_counts = pd.DataFrame({'Count': city_homicide_counts.groupby(['City','State','Longitude', 'Latitude']).size()}).reset_index()
In [30]:
city_homicide_counts
Out[30]:
City State Longitude Latitude Count
0 Abbeville South Carolina -82.3774 34.1787 59
1 Adair Iowa -94.6434 41.5004 2
2 Adair Oklahoma -95.2734 36.4365 56
3 Adams Illinois -91.1998 39.8708 26
4 Adams Nebraska -96.5129 40.4572 18
... ... ... ... ... ...
947 York Pennsylvania -76.7315 39.9651 406
948 York South Carolina -81.2341 34.9967 345
949 Yuma Arizona -114.5491 32.5995 204
950 Yuma Colorado -102.7161 40.1235 5
951 Zapata Texas -99.2612 26.9026 17

952 rows × 5 columns

In [31]:
# Generate token
mapbox_access_token = 'pk.eyJ1Ijoibmd1cHRhMTAiLCJhIjoiY2xwNHk1MXB1MDE3aTJqc2hwY2NlMDBtOSJ9.QamAm3h-eeOhPGhgJDkVAQ'
In [32]:
colors = ["royalblue","crimson","lightseagreen","orange"]
limits = [(0,10),(10,100),(100,200),(200,1000)]
cities = []
scale = 200
fig = go.Figure()

for i in range(len(limits)):
    lim = limits[i]
    df_sub = city_homicide_counts[lim[0]:lim[1]]
    fig.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
        lon = city_homicide_counts['Longitude'],
        lat = city_homicide_counts['Latitude'],
        text = df_sub['Count'],
        marker = dict(
            size = df_sub['Count']/scale,
            color = colors[i],
            line_color='rgb(40,40,40)',
            line_width=0.5,
            sizemode = 'area'
        ),
        name = '{0} - {1}'.format(lim[0],lim[1])))

fig.update_layout(
        title_text = 'Homicides',
        showlegend = True,
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)',
        )
    )
fig.show()
In [33]:
# Grouping data by cities and counting the number of homicides in each city
city_homicide_counts = homicides_df.groupby('City')['City'].count().sort_values(ascending=False).head(10)

# Plotting the cities with the most homicides
plt.figure(figsize=(12, 6))
city_homicide_counts.plot(kind='bar', color='pink')
plt.title('Top 10 Cities with the Most Homicides')
plt.xlabel('City')
plt.ylabel('Number of Homicides')
plt.xticks(rotation=45)
plt.show()

Let's make an observation based on our visualization :¶

The given code provides a thorough analysis of the homicide data, first classifying it by city to ascertain the total number of homicides in each. Los Angeles, Chicago, and New York are the three cities with the highest incidence of homicides, while the top 10 cities with the highest counts are prominently displayed in the resulting bar chart. This graphic depiction gives a clear picture of the cities with high homicide rates by highlighting the spatial distribution of violent episodes.Furthermore, the code looks at data at the state level to determine the weapon most frequently used in homicides in each state, providing a more comprehensive view of the various techniques utilized in deadly situations in various locations.

Let's make an inference based on our visualization :¶

A number of intricate socioeconomic and demographic factors may contribute to increased homicide rates in Los Angeles, Chicago, and New York City. Big cities frequently struggle with problems like gang activity, poverty, and a concentration of underprivileged neighborhoods. Homicides and other crimes may occur at higher rates as a result of these circumstances. Social unrest and criminal activity may be exacerbated by economic inequality and restricted access to jobs and educational possibilities in some communities. Increased population density, difficulties with law enforcement, and easier access to firearms could all contribute to an increased risk of violent occurrences. The convergence of these factors in major metropolitan areas such as Los Angeles, Chicago, and New York highlights the need for targeted and multifaceted approaches to address the underlying causes of violence

Question of Interest #2 :¶

What are the sex differences between homicide perpetrators and victims and the race breakdowns of victims? How have they changed over time?

Let us first do some basic analysis on perpetrator sex versus victim sex.

In [34]:
# creating homicides count based on victom sex to see the distributions
victim_sex_count = homicides_df['Victim Sex'].value_counts()
victim_sex_count
Out[34]:
Victim Sex
Male      492675
Female    141085
Name: count, dtype: int64

With 494,125 occurrences, the data shows that men make up a significant majority of the victims. This greatly exceeds the 143,345 recorded number of female victims. The notable gap between the number of victims by gender begs the question of what factors are at play and necessitates more research into the demographics of homicide episodes. It is imperative to comprehend gender dynamics in order to formulate focused interventions and policies that target and prevent violence. This underscores the significance of accurate and thorough data collecting in crime reporting.

In [35]:
# creating homicides count based on perpetrator sex to compare between the two
perpetrator_sex_count = homicides_df['Perpetrator Sex'].value_counts()
perpetrator_sex_count
Out[35]:
Perpetrator Sex
Male      396070
Female     48220
Name: count, dtype: int64

Remarkably, there are notably 399,541 more occurrences with male perpetrators than with female perpetrators. At 48,548 per year, female offenders are quite uncommon.

Let's make an observation based on the initial breakdown :¶

With 494,125 occurrences, the data shows that men make up a significant majority of the victims. This greatly exceeds the 143,345 recorded number of female victims. The dataset may have limitations because of inadequate or non-disclosed gender information in some situations, as seen by the inclusion of a category labeled 'Unknown' containing 984 events. The notable gap between the number of victims by gender begs the question of what factors are at play and necessitates more research into the demographics of homicide episodes. Remarkably, there are notably 399,541 more occurrences with male perpetrators than with female perpetrators. A significant obstacle to fully comprehending the gender distribution is the high presence of the 'Unknown' category, which accounts for 190,365 events. This raises concerns regarding reporting procedures or the constraints of data collection. At 48,548 per year, female offenders are quite uncommon.

Now, let's create a visualization of this!¶

We want to be able to view the breakdown between perpetrator sex and victim sex and create an inference based on the visualization.

In [36]:
x = homicides_df['Victim Sex'].value_counts().index
y1 = homicides_df['Victim Sex'].value_counts()
y2 = homicides_df['Perpetrator Sex'].value_counts()

data = pd.DataFrame({'Gender': x, 'Victim': y1, 'Perpetrator': y2})

victim_color = 'pink'
perpetrator_color = 'lavender'

sns.set_style("whitegrid")
plt.figure(figsize=(10, 6))

for i, gender in enumerate(x):
    total_count = y1[i] + y2[i]
    victim_percentage = y1[i] / total_count * 100
    perpetrator_percentage = y2[i] / total_count * 100

    plt.bar(x[i], y1[i], color=victim_color, label="Victim" if i == 0 else "")
    plt.bar(x[i], y2[i], bottom=y1[i], color=perpetrator_color, label="Perpetrator" if i == 0 else "")

    plt.text(i, y1[i] / 2, f'{victim_percentage:.1f}%', ha='center', va='center', color='black', fontsize=14)
    plt.text(i, y1[i] + y2[i] / 2, f'{perpetrator_percentage:.1f}%', ha='center', va='center', color='black', fontsize=14)

plt.xlabel('Gender', fontsize=14)
plt.ylabel('Percentage', fontsize=14)
plt.title('Gender Percentage Histogram', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(top=True, right=True)

plt.legend(title="Role", labels=["Victim", "Perpetrator"], fontsize=12)

plt.tight_layout()
plt.show()

Let's make an inference based on our visualization :¶

The gender distribution of homicide victims and perpetrators is shown in the stacked bar chart. Males predominate in both occupations, with far greater counts than females; this could be a reflection of underlying criminological and social trends. Crime statistics in the past have consistently demonstrated that men are more likely than women to be involved in violent crimes, both as offenders and victims. This could be due to a number of things, including socialization, cultural norms, and the frequency of particular risk factors that are more common in men.Furthermore, the noteworthy number of gender cases that are marked as "Unknown" highlights difficulties in gathering and reporting data, either pointing to holes in the investigation process or restrictions in the resources available.

Next, let's do some basic analysis on victim race and make an observation.

In [37]:
# creating a count based on race to see who is affected the most
victim_race_count = homicides_df['Victim Race'].value_counts()
victim_race_count
Out[37]:
Victim Race
White                            314796
Black                            298895
Asian/Pacific Islander             9831
Native American/Alaska Native      4563
Name: count, dtype: int64

According to the statistics, 317,422 instances are classified as White victims, and 299,899 incidents are classified as Black victims. These numbers indicate that White victims make up the majority of victims. Comparably fewer victims—9,890 for Asian/Pacific Islanders and 4,567 for Native Americans/Alaska Natives—have been reported in these categories. This analysis clarifies the differences in racial backgrounds among victims.

Let's make an observation based on the initial breakdown :¶

According to the statistics, 317,422 instances are classified as White victims, and 299,899 incidents are classified as Black victims. These numbers indicate that White victims make up the majority of victims. Comparably fewer victims—9,890 for Asian/Pacific Islanders and 4,567 for Native Americans/Alaska Natives—have been reported in these categories. This analysis clarifies the differences in racial backgrounds among victims.

Now, let's create a visualization of this!¶

We want to be able to view the breakdown of victim race, so that we can make an inference of why certain populations are more affected.

In [38]:
# we would like to show the distribution of victim race
threshold = 10000
small_counts = victim_race_count[victim_race_count < threshold]
victim_race_count['Other'] = victim_race_count[small_counts.index].sum()
victim_race_count.drop(small_counts.index, inplace=True)
fig, ax = plt.subplots()
ax.pie(victim_race_count, labels = victim_race_count.index, autopct='%.1f%%')
plt.pie(victim_race_count, colors = ['pink', 'lavender', 'black'])
plt.show()

Let's make an inference based on our visualization :¶

The given victim race distribution shows racial inequalities in homicide victimization, with White and Black people having the greatest rates. This distribution can be attributed to multiple variables. Certain racial and ethnic groups, especially Black communities, may be disproportionately exposed to greater rates of crime and violence due to socioeconomic conditions and social inequalities. Victimization risk may be exacerbated by historical discrimination, geographical segregation, and restricted resource availability. In addition, variations in police enforcement tactics and the frequency of violence associated to gangs may have an impact on the distribution.

In [39]:
# Sex differences between homicide perpetrators and victims
sex_diff_df = homicides_df.groupby(['Year', 'Victim Sex', 'Perpetrator Sex']).size().unstack().fillna(0)
sex_diff_df.plot(kind='bar', stacked=True, figsize=(15, 7))
plt.title('Sex Differences Between Homicide Perpetrators and Victims Over Time')
plt.xlabel('Year, Victim Sex')
plt.ylabel('Number of Cases')
plt.show()
In [40]:
# Race
race_breakdown_df = homicides_df.groupby(['Year', 'Victim Race']).size().unstack().fillna(0)
race_breakdown_df.plot(kind='bar', stacked=True, figsize=(20, 7))
plt.title('Race Breakdowns of Victims Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Cases')
plt.show()

Question of Interest #3 :¶

Are there times when the number of homicides significantly rises or falls, and if so, what may have caused these changes—such as modifications to policy, changes in the economy, or adjustments to law enforcement tactics? Would any economic suffering or recession period contribute a significant effective relation to the frequency of homicide? Does the frequency vary based on season?

First, let's take a look at the trend of homicides over the years and make an observation and inference based on our visualization.

Homicide Frequency Over the Years :¶

In [41]:
# Group homicides by Year for analysis.
homicide_by_year = homicides_df.reset_index().groupby(by='Year')['Record ID'].count().reset_index()
homicide_by_year.rename(columns={'Record ID': 'Count'}, inplace=True)

# Define X-values and Y-values.
x_val = homicide_by_year['Year']
y_val = homicide_by_year['Count']

# Creating the chart of Years vs Count of Homicides.
fig = go.Figure()

# Add a trace for each data point to show year and count.
fig.add_trace(go.Scatter(x=x_val, y=y_val,
                         mode='lines+markers', marker=dict(size=8, color='blue'), line=dict(width=2),
                         hovertemplate='<b>Year:</b> %{x}<br><b>Count:</b> %{y}'))

# Customizing layout and style of the chart.
fig.update_layout(
    title='Homicide frequency by Year',
    xaxis=dict(title='Year', tickmode='linear', dtick=1),
    yaxis=dict(title='Count of Homicides'),
    hoverlabel=dict(bgcolor='white', font_size=12, font_family='Arial'),
    font=dict(family='Arial', size=14, color='black'),
    plot_bgcolor='rgba(240,240,240,0.7)',
)

# Show the chart.
fig.show()

Let's make an observation based on our visualization :¶

The annual homicide frequency graph highlights a significant trend that occurred between 1990 and 2015. The graph reveals a high in the number of homicides between 1990 and 1995, demonstrating a considerable rise in violent occurrences during that time. Nonetheless, there appears to have been a noticeable decrease in the number of homicides between 2000 and 2015, indicating a drop in violent crime generally during that time.

Let's make an inference based on our visualization :¶

After peaking between 1990 and 1995, the observed reduction in homicide frequency from 2000 to 2015 may be caused by a combination of complex causes. The suppression of crime rates during this time may have been mostly due to changes in law enforcement tactics and policies. A more effective approach to law enforcement may have included targeted interventions in high-crime areas, increased community policing, and technological improvements for crime prevention and detection.Furthermore, there's a chance that more general social and economic circumstances have improved, as decreased rates of poverty, better access to education, and better economic prospects can all lower crime rates. It's also possible that community-based activities like anti-violence campaigns and rehabilitation programs helped to lower the number of violent events.

Next, let's take a look at the trend of homicides over certain months and make an observation and inference based on our visualization.

Homicide Frequency Over Specific Months :¶

In [42]:
# Group homicides by Month for analysis.
homicide_by_month = homicides_df.reset_index().groupby(by='Month')['Record ID'].count().reset_index()
homicide_by_month.rename(columns={'Record ID': 'Count'}, inplace=True)

# Inculcate sorting mechanism for Month column and sort.
months = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]
homicide_by_month['Month'] = pd.Categorical(homicide_by_month['Month'], categories=months, ordered=True)
homicide_by_month.sort_values(by='Month', inplace=True)

# Define X-values and Y-values.
x_val = homicide_by_month['Month']
y_val = homicide_by_month['Count']

# Creating a DataFrame from month and count.
data = {'Month': x_val, 'Count': y_val}
homicide_by_month = pd.DataFrame(data)

# Creating the chart of Months vs Count of Homicides.
fig = px.scatter(homicide_by_month, x='Month', y='Count', title='Homicide frequency by Month')

# Customizing layout and style of the chart.
fig.update_layout(
    xaxis=dict(title='Month', tickfont=dict(size=12, color='black')),
    yaxis=dict(title='Count of Homicides', tickfont=dict(size=12, color='black')),
    title=dict(text='Homicide frequency by Month', font=dict(size=24, family='Arial')),
    font=dict(family='Arial', size=14, color='black'),
    paper_bgcolor='rgba(255,255,255,0.7)',
    plot_bgcolor='rgba(240,240,240,0.7)',
)

# Show the chart.
fig.show()

Let's make an observation based on our visualization :¶

There is a clear seasonal trend to the number of homicides; they peak in June and September and reach a low point in February. Interestingly, there is a decrease from August to November, indicating that there may be variations in violent occurrences in these months.

Let's make an inference based on our visualization :¶

The monthly variation in the number of homicides indicates a clear trend: there are more homicides in June through September and fewer homicides in February. There could be a number of reasons for the higher rates over the summer, including more people engaging in outdoor activities, more social interactions, and possible dispute escalation. All of these things could lead to an increase in violent occurrences. Furthermore, in some areas, warmer weather is frequently associated with greater crime rates. Conversely, the colder weather and fewer outdoor activities may have contributed to February's lower homicide rates by reducing the likelihood of confrontation or criminal activity. The cyclical nature of these oscillations is further highlighted by the observed drop from August to November, which implies that there may be a dip in the elements that lead to increased violence as summer gives way to fall.

Finally, let's take a look at the trend of homicides over the different seasons and make an observation and inference based on our visualization.

Homicide Frequency by Season¶

In [43]:
# Define function for creating year-month function for Season wise analysis.
def year_month_conversion(obv):
    return str(obv['Year']) + '-' + str(obv['Month'])

# Create new Year-Month column.
homicides_df['Year-Month'] = homicides_df.apply(lambda obv: str(obv['Year']) + '-' + str(obv['Month']), axis=1)

# Define function to create Season bins.
def season_map(year_month):
    year, month = year_month.split('-')
    if month in ['December', 'January', 'February']:
        return 'Winter'
    elif month in ['March', 'April', 'May']:
        return 'Spring'
    elif month in ['June', 'July', 'August']:
        return 'Summer'
    elif month in ['September', 'October', 'November']:
        return 'Autumn'
    else:
        return 'None'

# Apply the function to create a new 'Season' column
homicides_df['Season'] = homicides_df['Year-Month'].apply(season_map)

# Group homicides by Season for analysis.
homicide_by_season = homicides_df.reset_index().groupby(by='Season')['Record ID'].count().reset_index()
homicide_by_season.rename(columns={'Record ID': 'Count'}, inplace=True)

# Define X-values and Y-values.
x_val = homicide_by_season['Season']
y_val = homicide_by_season['Count']

# Creating a DataFrame from seasons and count.
data = {'Season': x_val,
        'Count': y_val}

chart_df = pd.DataFrame(data)

# Creating the chart of Seasons vs Count of Homicides.
fig = px.box(chart_df, x='Season', y='Count', points='all', title='Homicide frequency by Season')

# Customizing layout and style of the chart.
fig.update_layout(
    title=dict(text='Homicide frequency by Season', font=dict(size=24, family='Arial')),
    xaxis=dict(title='Season', tickfont=dict(size=12, color='black')),
    yaxis=dict(title='Count of Homicides', tickfont=dict(size=12, color='black')),
    font=dict(family='Arial', size=14, color='black'),
    paper_bgcolor='rgba(255,255,255,0.7)',
    plot_bgcolor='rgba(240,240,240,0.7)',
)

# Update the traces in the figure to customize box plot points.
fig.update_traces(
    boxpoints='all',
    jitter=0.5,
    marker=dict(color='rgb(128, 0, 128)', opacity=0.8)
)

# Show the chart.
fig.show()

Let's make an observation based on our visualization :¶

An observation that we can make here is that the highest number of homicides occur in the summer and the lowest number of homicides occur in the winter.

Let's make an inference based on our visualization :¶

Why might this be the case? There might be more homicides that occur in the summer because of the temperature difference. Better weather might spur people to be outside more often and perpetrators have an easier time finding victims. Another reason could be that perpetrators have more free time during the summer, giving them ample time and freedom to commit these crimes.

Question of Interest #4 :¶

Correlation between perpetrator age vs. victim age. Which generation would be the most affected by homicides? Is there any generation gap between the victim and the perpetrator?

Correlation between perpetrator age and gender vs. victim age and gender. Which generation would be the most affected by homicides? Is there any generation gap between the victim and the perpetrator?

First, let's take a look at the correlation between victim age and perpetrator age and make an observation.

In [44]:
dummy_df = pd.get_dummies(homicides_df[['Victim Age', 'Perpetrator Age']])
correlation_Age = dummy_df.corr().loc['Victim Age', 'Perpetrator Age']
print(f"Correlation between Victim Sex and Perpetrator Sex: {correlation_Age}")
Correlation between Victim Sex and Perpetrator Sex: 0.32134999505895767
In [45]:
dummy_df_sex = pd.get_dummies(homicides_df[['Victim Sex', 'Perpetrator Sex']])
correlation_sex = dummy_df_sex.corr()

# Create dummy variables for Victim Age and Perpetrator Age
dummy_df_age = pd.get_dummies(homicides_df[['Victim Age', 'Perpetrator Age']])
correlation_age = dummy_df_age.corr()

# Create a matrix of both correlations
correlation_matrix = pd.concat([correlation_sex, correlation_age], keys=['Sex', 'Age'])

print("Correlation Matrix:")
print(correlation_matrix)
Correlation Matrix:
                            Victim Sex_Female  Victim Sex_Male  \
Sex Victim Sex_Female                1.000000        -0.998798   
    Victim Sex_Male                 -0.998798         1.000000   
    Perpetrator Sex_Female           0.000628        -0.000684   
    Perpetrator Sex_Male             0.071369        -0.070751   
Age Victim Age                            NaN              NaN   
    Perpetrator Age                       NaN              NaN   

                            Perpetrator Sex_Female  Perpetrator Sex_Male  \
Sex Victim Sex_Female                     0.000628              0.071369   
    Victim Sex_Male                      -0.000684             -0.070751   
    Perpetrator Sex_Female                1.000000             -0.370149   
    Perpetrator Sex_Male                 -0.370149              1.000000   
Age Victim Age                                 NaN                   NaN   
    Perpetrator Age                            NaN                   NaN   

                            Victim Age  Perpetrator Age  
Sex Victim Sex_Female              NaN              NaN  
    Victim Sex_Male                NaN              NaN  
    Perpetrator Sex_Female         NaN              NaN  
    Perpetrator Sex_Male           NaN              NaN  
Age Victim Age                 1.00000          0.32135  
    Perpetrator Age            0.32135          1.00000  

for which generation would be affected the most

In [46]:
bins = [0, 18, 35, 55, 75, 100]
labels = ['Gen Z', 'Millennials', 'Gen X', 'Baby Boomers', 'Silent Generation']
homicides_df['Age Group'] = pd.cut(homicides_df['Victim Age'], bins=bins, labels=labels, right=False)
age_group_counts = homicides_df['Age Group'].value_counts()

bar plot is needed to see it

In [47]:
plt.bar(age_group_counts.index, age_group_counts.values)
plt.title('Homicides by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Number of Homicides')
plt.show()

Is there any generation gap between the victim and the perpetrator?

In [48]:
homicides_df['Perpetrator Age'] = pd.to_numeric(homicides_df['Perpetrator Age'], errors='coerce')
homicides_df['Age Difference'] = homicides_df['Perpetrator Age'] - homicides_df['Victim Age']
homicides_df['Age Difference'] = scipy.stats.mstats.winsorize(homicides_df['Age Difference'], limits=[0.01, 0.01])
In [49]:
plt.hist(homicides_df['Age Difference'].dropna(), bins=20, edgecolor='black')
plt.title('Age Difference Between Victim and Perpetrator')
plt.xlabel('Age Difference')
plt.ylabel('Frequency')
plt.show()

Let's make an observation based on the correlation :¶

The dataset's victim and perpetrator ages show a rather slight positive link, with a correlation coefficient of 0.0419 between them. This indicates that there is a weak correlation between the ages of the offenders and victims increasing together. The positive sign means that the age of the perpetrator tends to increase along with the age of the victim, and vice versa. The tiny correlation coefficient indicates that other factors may be more important in influencing the association between the ages of homicide victims and perpetrators, but it's important to remember that correlation does not imply causation.

Question of Interest #5 :¶

For the crimes that were solved, which agency types were the most effective? Which states’ agencies were the best at solving the crimes? Which were the worst?

In [50]:
# Listing all the agency type
homicides_df['Agency Type'].value_counts()

# Creating a data frame that counting case solved or case not solved by the agency type
crime_solved_by_agency = homicides_df.groupby('Agency Type')['Crime Solved'].value_counts().unstack()
crime_solved_by_agency
Out[50]:
Crime Solved No Yes
Agency Type
County Police 7471 15035
Municipal Police 156648 333447
Regional Police 49 179
Sheriff 22143 81972
Special Police 825 2028
State Police 2510 11657
Tribal Police 4 56
In [51]:
data = dict(
    character = crime_solved_by_agency.index,
    parent = ["Agency Type", "Agency Type", "Agency Type", "Agency Type", "Agency Type", "Agency Type", "Agency Type"],
    value= crime_solved_by_agency['Yes']
)

fig = px.sunburst(
    data,
    names='character',
    parents='parent',
    values='value',
)
fig.show()
In [52]:
# Creating a data frame that counts cases solved or cases not solved by the state
crime_solved_by_agency = homicides_df.groupby('State')['Crime Solved'].value_counts().unstack()
crime_solved_by_agency.head()
Out[52]:
Crime Solved No Yes
State
Alabama 2338 8871
Alaska 296 1316
Arizona 3617 9106
Arkansas 1099 5765
California 36326 62902
In [53]:
# Sum up the total case numbers for each state
crime_solved_by_agency['Total cases'] = crime_solved_by_agency[['No','Yes']].sum(axis=1)
crime_solved_by_agency.head()
Out[53]:
Crime Solved No Yes Total cases
State
Alabama 2338 8871 11209
Alaska 296 1316 1612
Arizona 3617 9106 12723
Arkansas 1099 5765 6864
California 36326 62902 99228
In [54]:
# Calculating the percentage for case solved for each state
solved_percentage = crime_solved_by_agency['Yes']/crime_solved_by_agency['Total cases']*100
print(solved_percentage.sort_values(ascending=False).head(10))
print('\n\n')
print(solved_percentage.sort_values(ascending=True).head(10))
State
North Dakota      93.114754
Montana           92.736486
South Dakota      92.093023
South Carolina    90.720317
Idaho             90.299824
Wyoming           90.192926
West Virginia     89.505148
Maine             89.277389
Vermont           88.059701
Iowa              87.089337
dtype: float64



State
District of Columbia    34.225352
New York                54.059123
Maryland                59.420122
Illinois                61.140419
Massachusetts           63.254418
California              63.391381
Missouri                63.702347
New Jersey              63.931624
Connecticut             66.392262
Michigan                66.723296
dtype: float64

Question of Interest #6 :¶

What kind of weapons did certain age groups prefer to use? What kind of weapons did the different genders given use?

In [55]:
age_bins = [1, 10, 20, 30, 40, 50, 60, 70, 80]
age_labels = ['1-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70']
age_labels.append('71-80')
homicides_df['Perpetrator Age Group'] = pd.cut(homicides_df['Perpetrator Age'], bins=age_bins, labels=age_labels, right=False)
weapon_counts = homicides_df.groupby(['Perpetrator Age Group', 'Weapon']).size().unstack(fill_value=0)
weapon_by_age = weapon_counts.idxmax(axis=1)
weapon_counts_age = weapon_counts.max(axis=1)
weapon_age_df = pd.DataFrame({'Age Group':weapon_by_age.index, 'Weapon':weapon_by_age.values, 'Count': weapon_counts_age.values})
print(weapon_age_df.sort_values(by='Count', ascending=False))
  Age Group   Weapon  Count
2     21-30  Handgun  79002
3     31-40  Handgun  41159
1     11-20  Handgun  38340
4     41-50  Handgun  21808
5     51-60  Handgun  10985
6     61-70  Handgun   5127
7     71-80  Handgun   1185
0      1-10  Handgun    132
In [57]:
#created horizontal bar chart
colors = sns.color_palette("viridis", len(weapon_age_df))
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x='Count', y='Age Group', data=weapon_age_df, palette=colors)
plt.xlabel('Count')
plt.ylabel('Age Group')
plt.title('Weapon Counts by Age Group')
for index, value in enumerate(weapon_age_df['Count']):
    bar_plot.text(value, index, f'{value:,}', ha='left', va='center', color='black')

plt.show()
In [59]:
homicides_df['Perpetrator Sex'].unique()
weapon_counts_sex = homicides_df.groupby(['Perpetrator Sex', 'Weapon']).size()
weapon_counts_df = weapon_counts_sex.reset_index(name='Count')
plt.figure(figsize=(14, 10))
sns.barplot(x='Weapon', y='Count', hue='Perpetrator Sex', data=weapon_counts_df, palette='deep')
plt.title('Weapon Counts by Perpetrator Sex')
plt.xlabel('Weapon')
plt.ylabel('Count')
plt.xticks(rotation=30, ha='right')
plt.legend(title='Perpetrator Sex')
plt.show()

Question of Interest #7 :¶

What makes some areas more likely to have homicides, and can we use predictive analysis to find out which neighborhoods or city blocks are most at risk?

In [61]:
# Engineer feature for the target variable
homicides_df['HighRisk'] = homicides_df['Victim Count'].apply(lambda x: 1 if x > 0 else 0)

# Select features
features = ['City', 'Weapon', 'Relationship', 'Victim Age', 'Perpetrator Age', 'Victim Race']

# Encode categorical columns
category_cols = ['City', 'Weapon', 'Relationship', 'Victim Race']
encoders = {}  # to store encoders for later use
for f in category_cols:
    encoder = LabelEncoder()
    homicides_df[f] = encoder.fit_transform(homicides_df[f])
    encoders[f] = encoder

# Split data
X_train, X_test, y_train, y_test = train_test_split(homicides_df[features], homicides_df['HighRisk'], test_size=0.2, random_state=0)

# Impute missing values
imputer = SimpleImputer(strategy='mean')  # You can choose a different strategy
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Train model (random forest used here)
model = RandomForestClassifier(random_state=0)
model.fit(X_train_imputed, y_train)

# Evaluate predictions
preds = model.predict(X_test_imputed)

# Print evaluation metrics
accuracy = accuracy_score(y_test, preds)

print('Accuracy:', accuracy)
Accuracy: 0.9210125783683609
In [62]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Generate confusion matrix
cm = confusion_matrix(y_test, preds)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Low Risk', 'High Risk'], yticklabels=['Low Risk', 'High Risk'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
In [65]:
import matplotlib.pyplot as plt
import seaborn as sns

# Get feature importances
feature_importances = model.feature_importances_
feature_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
feature_df = feature_df.sort_values(by='Importance', ascending=False)

# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_df, palette='viridis')
plt.title('Feature Importance')
plt.show()

Conclusions :¶

This project provides a detailed analysis of homicides, revealing critical trends and patterns that contribute to understanding the socio-economic and demographic dynamics influencing these incidents. The data visualization and statistical techniques applied have successfully highlighted significant factors contributing to homicide rates, including regional disparities, temporal patterns, and underlying societal influences.

The insights gained from this study can serve as a foundation for further research and policy formulation aimed at reducing homicide rates. By addressing key factors such as poverty, education, and law enforcement efficiency, stakeholders can develop targeted interventions to foster safer communities.

This analysis underscores the importance of data-driven approaches in tackling complex societal challenges and emphasizes the need for continuous data collection and analysis to monitor progress and adapt strategies accordingly.